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Abstract 



Many algorithms have been proposed for fitting network models with communi- 
ties but most of them do not scale well to large networks, and often fail on sparse 
networks. Here we propose a new fast pseudo-likelihood method for fitting the 
stochastic block model for networks, as well as a variant that allows for an arbi- 
trary degree distribution by conditioning on degrees. We show that the algorithms 
perform well under a range of settings, including on very sparse networks, and 
illustrate on the example of a network of political blogs. We also propose spectral 
clustering with perturbations, a method of independent interest, which works well 
on sparse networks where regular spectral clustering fails, and use it to provide an 
initial value for pseudo-likelihood. 

1 Introduction 

Analysis of network data is important in a range of disciplines and applications, appearing in such 
diverse areas as sociology, epidemiology, computer science, and national security, to name a few. 
Network data here refers to observed edges between nodes, possibly accompanied by additional 
information on the nodes and/or the edges, e.g., edge weights. One of the fundamental questions in 
analysis of such data is detecting and modeling community structure within the network. A lot of 
algorithmic approaches to community detection have been proposed, mainly in the physics literature 
(see 1 1 1, |2 1 for reviews). These include various greedy methods such as hierarchical clustering (see 
P i for a review) and algorithms based on optimizing a global criterion over all possible partitions, 
such as normalized cuts |4| and modularity |5|. The statistical learning community has been more 
focused on model-based methods, which postulate and fit a probabilistic model for a network with 
communities. These include the popular stochastic block model |6|, its extensions to include varying 
degree distributions within communities [TJ and overlapping communities ||8||91, and various latent 
variable models |10 H]. 

The stochastic block model is perhaps the most commonly used and best studied model for com- 
munity detection. For a network with n nodes defined by its n x n adjacency matrix A, this model 
postulates that the true node labels c — {ci, - ■ ■ , c„) e {1, • • • , /C}" are drawn independently from 
the multinomial distribution with parameter tt = (tti, • • • , ttk), where K is the number of commu- 
nities, assumed known. Conditional on the labels, the edge variables Aij for i < j are independent 
Bernoulli variables with 



E[Ay|c] =P,,,^ 



(1) 
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where P — [Pab] is a K x K symmetric matrix. The network is undirected, so Aji — Aij, and 
An ~ (no self-loops). The problem of community detection is then to infer the node labels c from 
A, which typically also involves estimating tt and P. 

There are many extensions of the block model but we will only mention one here that we will use 
later in the paper The block model implies the same expected degree for all nodes within a commu- 
nity, which excludes networks with "hub" nodes commonly encountered in practice. The degree- 
corrected block model |7| removes this constraint by replacing ([T]) with E[Ay |c] = didjPacj , where 
9i's are node degree parameters which satisfy an identifiability constraint. If the degree parameters 
only take on a discrete number of values, one can think of the degree-corrected block model as a 
regular block model with a larger number of blocks, but that loses the original interpretation of com- 
munities. In |7| the Bernoulli distribution for Aij was replaced by the Poisson, primarily for ease of 
technical derivations, and in fact this is a good approximation for a range of networks [IZj. 

Fitting block models is non-trivial, especially for large networks, since in principle the problem 
of optimizing over all possible label assignments is NP-hard. In the Bayesian framework, Markov 
Chain Monte Carlo methods have been developed llJ' T?! but they only work for networks with 
a few hundred nodes. Variational methods have also been developed and studied (see for example 
ll8][l5]|T6l), and are generally substantially faster than the Gibbs sampling involved in MCMC, but 
still do not scale to the order of a million nodes. A profile likelihood approach was proposed in 
ifTTl : since for a given label assignment parameters can be estimated trivially by plug-in, they can 
be profiled out and the resulting criterion can be maximized over all label assignments by greedy 
search. The same method is used in |7| to fit the degree-corrected block model. The speed of 
these algorithms depends on exactly what search method is used and the number of iterations it is 
run for, but again these generally work well for thousands but not millions of nodes. A method of 
moments approach was proposed in (181, for a large class of network models that includes the block 
model as a special case. The generality of this method is an advantage, but it involves counting all 
occurrences of specific patterns in the graph, which is computationally challenging beyond simple 
special cases. Some faster approximations for block model fitting based on spectral representations 
are also available |fT9ll20l . but the properties of these approximations are only partially known. 

In this paper, we propose a new fast pseudo-likelihood algorithm for fitting the block model, as 
well as a variation conditional on node degrees that allows for fitting networks with highly variable 
node degrees within communities. The idea of pseudo-likelihood dates back to |21 1, and in general 
amounts to ignoring some of the dependency structure of the data in order to simplify the likelihood 
and make it more tractable. The main feature of the adjacency matrix we ignore here is its symmetry; 
we also apply block compression, that is, divide the nodes into blocks and only look at the likelihood 
of the row sums within blocks. This leads to an accurate and fast approximation to the block model 
likelihood, which allows us to easily fit networks with tens of millions of nodes. We also propose a 
regularized version of spectral clustering, a new clustering method of independent interest which we 
use to initialize pseudo-likelihood. For sparse networks, regular spectral clustering often performs 
very poorly, likely due to the presence of many disconnected components. We perturb the network 
by adding additional weak edges to connect these components, resulting in regularized spectral 
clustering which performs well under a wide range of settings. In the rest of the paper, we lay out 
the algorithms, and demonstrate their performance on a range of simulated networks as well as on 
a network of political blogs. The theoretical properties of the algorithms are postponed to a later 
journal paper 

2 Algorithms 

2.1 Pseudo-likelihood 

The joint likelihood of A and c could in principle be maximized via the expectation-maximization 
(EM) algorithm, but the E-step involves optimizing over all possible label assignments, which is in 
principle NP-hard. Instead, we introduce an initial labeling vector e — (ei , . . . , e„) which partitions 
the nodes into K groups (each takes values in {1, . . . , K}). Note that for convenience we partition 
into the same number of groups as we assume to exist in the true model, but in principle the same 
idea can be applied with a different number of groups; in fact dividing the nodes into n groups with 
a single node in each group instead gives an algorithm equivalent to that of |22|. 
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The main quantity we work with are the block sums along the columns 



(2) 



j 



for i — 1, . . . ,n, k = 1, . . . , K. Let bi — (6 a , . . . , biK)- Further, let R be the K x K matrix with 
entries {Rka} given by 



Let Rkm be the fcth row of R, and let P,i be the kh column of P. Let Xik = nRk,P,i and A = {\ik}- 

Our approach is based on the following key observations: for each node i, conditional on labels 
c = (ci, . . . , c„) with a = I, 

(A) {bii, ■ ■ ■ , biK} are mutually independent; 

(B) bik, a sum of independent Bernoulli variables, is approximately Poisson with mean Xik- 

With true labels {ci} unknown, each bi can be viewed as a mixture of Poisson vectors. A necessary 
condition for the identifiability of parameters (vr, RP), or equivalently (tt, A), is that the means of 
the Poisson mixture components are different, i.e., A has no two identical rows. This has implica- 
tions for the theoretical analysis of the method, but we do not focus on this here. 

By ignoring the dependence among {bi, i = 1, . . . , n}, treating {a} as latent variables, and setting 
A; — J2k ^'fc' ^'^^ write the pseudo log-likelihood as follows (up to a constant): 



A pseudo-likelihood estimate of (vr, A) can then be obtained by maximizing £pl{t^, A; {bi}). This 
can be done via the EM algorithm for mixture models, a fairly standard tool, which alternates updat- 
ing parameter values with updating probabilities of node labels. Once the EM converges, we update 
the initial partition vector e to the most likely label for each node as indicated by EM, and repeat 
this process for a fixed number of iterations T. 

For any labeling e, we use the following notation: nk{e) = J^i ^i^i ^ "-feK^) — rik{e)ni{e) if 
k^l, nkk{e) = nk{e){nk{e) - 1), and Oki{e) = ^ij = k, = I). We will suppress the 
dependence on e whenever there is no ambiguity. 

The pseudo-likelihood algorithm. Initialize labels e, and let ni = ni/n, R ~ diag(7ri, . . . , tt^), 
Pik = Oik/nik, Xik = nRk,P,i, P = {Pik} and A = {A;fe}. Then repeat T times: 

1. Compute the block sums {bn} according to (|2|l, 

2. Using current parameter estimates n and A, estimate probabilities for node labels by 




(3) 




(4) 



ffii = FpL(ci ^l\bi) 



Ylm=l exp(&iTn log Aim 



Efcil rim^l exp(6jm log Xk: 



3. Given label probabilities, update parameter values as follows: 




n 



4. Return to step 2 unless the parameter estimates have converged. 

5. Update labels by — arg max/ tth and return to step 1 . 

6. Update P as follows: Pik = ^ij^u'^jk) /nik{e). 
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In practice, in step 6 we only include the terms corresponding to ttu greater than some small thresh- 
old. The EM method is fitting a valid mixture model and is thus guaranteed to converge to a sta- 
tionary point of the objective function |23 1. Another option is to update labels after every parameter 
update (that is, skip step 4.) We have found empirically that the algorithm above is more stable, and 
converges faster In general, we only need a few label updates until convergence, and even using 
T = 1 (one-step label update) ) gives reasonable results with a good initial value. The choice of the 
initial value of e, on the other hand, can be important; see more on this in Section l273] 



2.2 Pseudo-likelihood conditional on node degrees 

For networks with hub nodes or those with substantial degree variability within communities, the 
block model can provide a poor fit, essentially dividing the nodes into low-degree and high-degree 
groups 171 [24l . The extension of the block model designed to cope with this situation, the degree- 
corrected block model 1 7 1 , has an extra degree parameter to be estimated for every node, and writing 
out a pseudo-likelihood that lends itself to an EM-type optimization is more complicated. However, 
there is a simple alternative: consider the pseudo-likelihood conditional on the observed node de- 
grees. Whether these degrees are similar or not will not then matter, and the fitted parameters will 
reflect the underlying block structure rather than the similarities in degrees. 

The conditional pseudo-likelihood is again based on a simple observation: 

(C) If random variables Xk are independent Poisson with means fik, their distribution condi- 
tional on J2k -^k is multinomial. 

Applying this observation to the variables (6^1 , • • • , ), we have that their distribution conditional 
on labels c with Ci ~ I and the node degree di — J2k is multinomial with parameters 9ik — 
The conditional log pseudo-likelihood (up to a constant) is then given by. 



^cPL(7r, 8; {bj) = V log V TT (5) 




and the parameters can be obtained by maximizing this function via the EM algorithm for mixture 
models, as before. We again repeat the EM for a fixed number of iterations updating the initial 
partition vector after the EM has converged. The algorithm is then the same as that for unconditional 
pseudo-likelihood, with steps 2 and 3 replaced by: 



nK nbir, 



2'. Based on current estimates tt and {6ik}, let 

nu=¥cPLic, = l\b,) = 

Z^fc=l llm=l "km 

3'. Given label probabilities, update parameter values as follows: 

. 1 -A . p. E»^»i^»fc 



2.3 Initializing the partition vector 

We now turn to the question of how to initialize the partition vector e. Note that the full likelihood, 
pseudo-likelihoods and ^cpl, and other standard objective functions such as modularity can all 
be multi-modal. The numerical results in Section [3] suggest that the initial value cannot be entirely 
arbitrary, but the results are not too sensitive to it. We will quantify this further in Section |3l here 
we describe the two options we use as initial values, both of which are of independent interest as 
clustering algorithms for networks. 



2.3.1 Clustering based on 1- and 2-degrees 

One of the simplest possible ways to group nodes in a network is to separate them by degree, say by 
one-dimensional JsT-means clustering applied to the degrees as in Il25l . This only works for certain 
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types of block models identifiable from their degree distributions, and in general if-means does 
not deal well with data with many ties, which is the case with degrees. Instead, we consider two- 

(2) (2) 

dimensional iC-means clustering on the pairs {di,d\ ), where d\ is the number of paths of length 
2 from node i, which can be obtained by summing the rows of A^. 



2.3.2 Spectral clustering with perturbations 

A more sophisticated clustering scheme is based on spectral properties of the adjacency matrix 
A = {Aij} or its graph Laplacian. Let D — diag(o?i, . . . , dn) be diagonal matrix collecting node 
degrees. A common approach is to look at the eigenvectors of the normalized graph Laplacian L = 
D^^/^AD^^^^, choosing a small number, say r = K ~ 1, corresponding to r largest (in absolute 
value) eigenvalues, with the largest eigenvalue omitted (see, e.g., |4|). These vectors provide an r- 
dimensional representation for nodes of the graph, on which we can apply K-means to find clusters; 
this is one of the versions of spectral clustering, which was analyzed in the context of the block 
model by 120|. 

We found that this version of spectral clustering tends to do poorly at community detection when 
applied to sparse graphs, say, with expected degree A < 5. The r-dimensional representation seems 
to collapse to a few points, likely due to the presence of many disconnected components. We have 
found, however, that a simple modification performs surprisingly well even for values of A close to 1 . 
The idea is to connect all disconnected components which belong to the same community by adding 
artificial "weak" links. To be precise, we "regularize" the adjacency matrix A by adding a/px A/n 
multiplied by the adjacency matrix of an Erdos-Renyi graph on n node with edge probability p, 
where a is a constant. We found that, empirically, a/p — 0.25 works well for the range of n 
considered in our simulations, and that the results are essentially the same for all p > 0.1 Thus we 
make the simplest and computationally cheapest choice of p = 1, adding a constant matrix of small 
values to the original adjacency matrix. The rest of the steps, i.e., forming the Laplacian, obtaining 
the spectral representation and applying K-means, are performed on this regularized version of A. 
We will refer to this algorithm as spectral clustering with perturbations (SCP), as we perturb the 
network by adding new low-weight "edges". 



3 Numerical results 



Here we investigate the performance of both the unconditional and conditional pseudo-likelihood 
algorithms on simulated networks, as well as that of spectral clustering with perturbations. We sim- 
ulate two scenarios, one from the regular stochastic block model and one from the degree-corrected 
block model, to assess the performance in the presence of hub nodes. Throughout this section, we 
fix K ^ 3 and tt ~ (1/3, 1/3, 1/3). Conditional on the labels, the edges are generated as inde- 
pendent Bernoulli variables with probabilities proportional to OiOjPij. The parameters 9j are drawn 
independently from the disti'ibution of 8 with P(8 = 0.2) = 7, P(9 = 1)^1-7. We do not 
enforce the identifiability scaling constraint on 9 at this point as it is absorbed into the scaling of 
the matrix P in (|6]l below. We consider two values of 7: 7 = 0, which corresponds to the regular 
block model, and 7 — 0.9, which corresponds to a network where 10% of the nodes can be viewed 
as hubs. 

The matrix P is constructed as follows. It is controlled by two parameters: the "out-in-ratio" /3 
||26l, which we will vary from to 0.2, and the weight vector w, which determines the relative 
degrees within communities. We consider two values of w: w = (1, 1, 1) (no information about 
communities is contained in node degrees) and w = (1, 5, 10) (degrees themselves provide relevant 
information for clustering). If /3 = 0, we set = diag(?x;), a diagonal matrix. Otherwise, we set 
the diagonal of Xo jS^'^w and set all off-diagonal elements to 1. We then fix the overall expected 
network degree A, which is the natural parameter to control flTl and which we will vary from 1 to 
15. Then we rescale P'"' to obtain this expected degree, giving the final P 

P = - p(o) (6) 

(n- l)(^^P(o)^)(Ee)2 ■ ^ ' 

To compare our results to the true labels, we will use normalized mutual information (NMI). One 
can think of the confusion matrix P as a bivariate probability distribution, and of its row and col- 
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umn sums and R+j as the corresponding marginals. Then the NMI is defined by (271 as 
NMI(c, e) = — ^ Rij log j^.^j^^ (X^z j log-Ry )^^, and is always a number between and 1 
(perfect match). It is useful to have a few benchmark values of NMI for reference: for example, for 
large n, matching 50%, 70%, and 90% of the labels correspond to values of NMI of approximately 
0.12, 0.26, and 0.58, respectively. 

All figures show the performance of the following methods: A'-means clustering on 1- and 2-degrees 
(DC), spectral clustering (SC), spectral clustering with perturbations (SCP), unconditional pseudo- 
likelihood (UPL) initialized with either DC or SCP, and conditional pseudo-likelihood (CPL), with 
the same two initial values for labelings. The number of outer iterations for UPL and CPL is set to 
T = 20; n. A, 7 and the number of replications N are specified in the figures. 




Figure 1 : The NMI between true and estimated labels as a function of "out-in-ratio" /3. 

Figures [T] and |2] show results on estimating the node labels with varying 13 and A, respectively. 
Generally, smaller /3 and larger A make the problem easier, as we expect. In principle, degree- 
based clustering gives no information about the labels with uniform weights w, and only a moderate 
amount of information with non-uniform weights, so it serves as an example of a poor starting value 
for pseudo-likelihood. Regular spectral clustering performs well with uniform weights, but very 
poorly with non-uniform weights; we conjecture that this is due a limitation of A'-means. Spectral 
clustering with perturbation, on the other hand, performs very well in all scenarios. Apart from 
being a useful general method on its own, it also serves as an example of a good starting value for 
pseudo-likelihood. 

Figures [U and |2] show that pseudo-likelihood achieves large gains over a poor starting value, giving 
surprisingly good results even when starting from the uninformative degree clustering in the case of 
w — (1, 1, 1). One exception is unconditional pseudo-likelihood with 7 = 0.9 and w — (1, 1, 1), 
which shows that conditioning is necessary to accomodate variation in degrees when the starting 
value is not very good. When spectral clustering with perturbation is used as a starting value, which 
is akeady very good, UPL and CPL do not have much room to do better, although UPL still provides 
a noticeable improvement, being overall the best method when initialized with SCP. It appears that 
a good starting value overcomes the limitations of the regular block model for networks with hubs, 
effectively ruling out the competing solution which divides nodes by degree. 
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Figure 2: The NMI between true and estimated labels as a function of average expected degree A. 




Finally, Figure[3]shows run times for all the methods for the case of the regular block model (7 — 0) 
with different community weights {w = (1, 1, 1) and w = (1, 5, 10)). The times shown for UPL 
and CPL do not include the time to compute the initial value, which is shown separately. For the 
case w — (1, 1, 1), all methods take roughly the same amount of time. For the case w ~ (1, 5, 10), 
spectral clustering (SC) takes considerably more time than the rest. On the other hand, SCP takes 
nearly the same time as it takes for w = (1,1,1), and it slightly outperforms DC for larger values 
of n. This might be explained, in part, by the sparse matrix multiplication required for DC, which is 
both time and memory-consuming for large n. Generally, SCP provides an excellent starting value, 
with low computational complexity in a variety of situations. 
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4 Example: a political blogs network 



This dataset on political blogs was compiled by f28l soon after the 2004 U.S. presidential election. 
The nodes are blogs focused on US politics and the edges are hyperlinks between these blogs. Each 
blog was manually labeled as liberal or conservative by ||28| . and we treat these as true community 
labels. Following |7|, we ignore directions of the hyperlinks and analyze the largest connected 
component of this network, which has 1222 nodes and the average degree of 27. The distribution of 
degrees is highly skewed to the right (the median degree is 13, and the maximum is 351). 




Figure 4: Political blogs data: true labels, spectral clustering with perturbations (SCP, a = 0.25), 
unconditional and conditional pseudo-likelihoods (UPL and CPL) initialized with SCP. Node size is 
proportional to log degree. 

The results in Figure |4] show that the conditional pseudo-likelihood produces a result closest to the 
truth, as one would expect in view of highly variable degrees. Its result is also very close to those 
obtained by profile maximum likelihood for the degree-corrected block model and by two different 
modularities ||7][24l. Unconditional pseudo-likelihood, on the other hand, puts high-degree nodes 
in one group and low-degree nodes in the other This is very close to the block model solution 
||7 |, which is a confirmation of unconditional pseudo-likelihood converging to the true block model 
estimate in this case. Spectral clustering with perturbations, with the default setting a — 0.25, does 
something in between, misclassifying most of the low-degree nodes in one community (33% overall 
classification error), but yet provides a good enough starting value for CPL to recover a grouping 
close to the truth (5% classification error). Interestingly, a smaller a used in SCP, e.g., a ~ 0.01, 
gives the same solution as CPL in this case. We hypothesize that this happens here because we only 
work with the largest connected component; dropping edges at random until the average degree was 
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5 and applying SCP with a = 0.25 to this much sparser network resulted in misclassifying only 6% 
of the nodes, confirming the hypothesis that the main advantage of SCP is correcting the problem 
caused by many small discormected components. Regular spectral clustering does poorly, possibly 
because with K = 2 communities we only keep one eigenvector and K-mews in one dimension 
often has problems; if we keep a larger number of eigenvectors, say 5, spectral clustering performs 
as well as CPL. 

5 Discussion 

We have demonstrated empirically that the proposed algorithms provide fast and accurate commu- 
nity detection for a range of settings, including large and sparse networks. The pseudo-Ukelihood 
approximations we rely on have a long history of empirical success in statistics, but are not well 
studied theoretically. In ongoing work we are investigating the theoretical properties of the pseudo- 
likelihood algorithms as the expected degree A ^ oo with n, as well as the theoretical properties of 
spectral clustering with perturbations. Developing a data-driven method for selecting the perturba- 
tion tuning parameter a is also a topic for future work. 
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